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Abstract 


Integrated circuit technology has been advancing at a phenomenal rate over the 
last several years, and promises to continue to do so. If circuit design is to keep pace 
with fabrication technology, radically new approaches to computer-aided design will 
be necessary. One appealing approach is general purpose parallel processing. This 
thesis explores the issues involved in developing a framework for circuit simulation 
which exploits the locality exhibited by circuit operation to achieve a high degree of 
parallelism. This framework maps the topology of the circuit onto the multiprocessor, 
assigning the simulation of individual partitions to separate processors. A new form of 
synchronization is developed, based upon a history maintenance and roll back strategy. 
The circuit simulator PRSIM was designed and implemented to determine the efficacy 
of this approach. The results of several preliminary experiments are reported, along 
with an analysis of the behavior of PRSIM. 
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Chapter I 


Introduction 


An important component of any design process is a mechanism for incrementally 
checking the validity of design decisions and the interactions among those decisions. 
There must be a feedback path from the partially completed design back to the de- 
signer, allowing the designer to find and correct mistakes before fabrication. In modern 
digital circuit design, this feedback path is often provided by computer-aided simula- 
tion. However, in recent years integrated circuit technology has been advancing very 
rapidly. It is now possible to build chips containing more than 500,000 transistors. 
The current generation of simulation tools is already stretched to the limit, and will 
soon prove incapable of meeting this increase in demand. If circuit design is to keep 
pace with technology, radically new approaches to simulation will be necessary. One 
promising approach is to depart from the von Neumann style of computation and take 
advantage of recent advances in the field of parallel processing to build fast, scalable 


simulation tools. 


1.1. Overview 


In digital circuit design, the feedback path from a partially completed design back 


a 


to the designer is typically provided by computer-aided simulation. Historically, there 
have been two general approaches to circuit simulation: analytical and functional. Ana- 
lytical simulators, such as SPICE, use detailed, non-linear models of circuit. components 
drawn from fundamental physical principles, and solve the resulting set of ordinary 
differential equations using sparse matrix methods [12]. Because of this level of detail, 
analytical simulators tend to be computationally expensive, and so are limited in prac- 
tice to the simulation of relatively small circuits (a few tens or hundreds of transistors). 
More recently, a number of algorithms have been developed to substantially improve the 
performance of circuit analysis programs. These include table lookup methods, such 
as those used in MOTIS [5], and iterated relaxation methods, such as those employed 
by SPLICE [18] and RELAX [13]. Although these newer techniques offer more than an 
order of magnitude performance improvement over the sparse matrix approach, they 


still cannot economically simulate one entire chip. 


At the opposite end of the spectrum from circuit analysis are functional simula- 
tors, such as LAMP [4] and MOSSIM [3], which combine very simple models of circuit 
components, e.g., gates or switches, with efficient event based simulation algorithms. 
This class of simulation tool is very useful for determining logical correctness, but offers 
no timing information. In the past few years, a third approach has emerged which tries 
to find a middle ground between analytical and functional simulation. Examples of 
this approach include the timing analyzers CRYSTAL [14] and TV [9], and the circuit 
simulator RSIM [19]. Each of these tools uses simple linear models of the electrical char- 
acteristics of the components to predict the timing behavior of a circuit. These tools 
permit one to obtain timing information on circuits of tens of thousands of devices, at 
the expense of some accuracy. Unfortunately, they are also reaching the limits of their 


capacities. 


There are several approaches to solving the problem of capacity limitations. The 
first, and most obvious, solution is to vectorize the old algorithms to run on faster 


machines, such as the Cray and the CDC Cyber. The second approach is to develop 
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new, faster algorithms, such as the relaxation based schemes mentioned earlier. An- 
other approach which has gained favor in certain circles is the development of special 
purpose hardware which is capable of running one specific algorithm very fast. Exam- 
ples of this approach are the simulation pipeline of Abramovici [1], and the Yorktown 
Simulation Engine, developed by IBM [15]. Unfortunately, these solutions tend to be 
very expensive and applicable to only a very limited class of problems. 

General purpose parallel processing offers several advantages over these other ap- 


proaches. 


e Scalability — Simulation algorithms can be developed which are indepen- 
dent of the number of processors in the system. As the size of the circuit 
grows, the number of processors, and hence the performance of the sim- 
ulation, can grow. 

e Flexibility — The machine architecture is not tuned for one particular 
algorithm. Therefore, the same physical hardware can be pressed into 
service for a wide range of applications, extending the utility of the ma- 
chine. 

e Portability — The parallel algorithms developed need not be constrained 
to a particular machine architecture. Therefore, the same algorithms can 
be run on a wide variety of parallel systems, extending the utility of the 


algorithms. 

This thesis explores the issues involved in developing a framework for circuit simu- 
lation which can utilize the advantages offered by general purpose parallel computation. 
The approach is based upon the observation that the locality of digital circuit opera- 
tion, and the resulting independence of separate subcircuits, leads very naturally to a 
high degree of parallelism. The framework developed in this thesis attempts to reflect 


the inherent parallelism of the circuit in the structure of the simulator. 


1.2. Chapter Outline 


Chapter 2 presents a novel approach to digital circuit simulation. This chapter 
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begins by exploring the techniques for mapping the circuit under simulation onto the 
topology of a general purpose multiprocessor. The synchronization problems imposed 
by the resulting precedence constraints are then examined, and a unique solution based 
upon history maintenance and roll back is proposed. The problem of partitioning a 
circuit in a fashion conducive to this form of simulation is then addressed. Finally, 
related work in the field of parallel simulation is reviewed. 

Chapter 3 presents the implementation of the simulator Parallel RSIM, or PRSIM. 
This chapter begins with background information on the RSIM simulation algorithm and 
the Concert multiprocessor on which PRSIM is built. The overall structure of PRSIM 
is presented, with particular concentration on interprocessor communication and the 
history maintenance and roll back synchronization mechanisms. 

Chapter 4 presents experimental results obtained from PRSIM. A series of exper- 
iments were designed and run to determine the overall performance of PRSIM, and to 
develop a solid understanding of the various overhead costs in PRSIM. The results from 
these experiments are analyzed, and some conclusions are drawn. 

Chapter 5 concludes the thesis with a summary of the work reported and sugges- 


tions for future research. 


SG 


Chapter II 


Parallel Simulation 


Digital circuit operation exhibits a high degree of locality. At the device level, 
there is locality in the operation of individual transistors. Each transistor operates in 
isolation, using only the information available at its terminal nodes. At a somewhat 
higher level, there is locality in the operation of combinational logic gates. The output 
behavior of a gate is strictly a function of its input values. At a still higher level, 
there is locality in the operation of functional modules. The instruction decode unit 
of a microprocessor has no knowledge of what is transpiring in the ALU. It merely 


performs some function upon its inputs to produce a set of outputs. 


The locality property of circuit operation is reflected in the structure of many 
simulation algorithms. So called event based simulators exhibit a similar degree of 
locality. A switch level simulator determines the value of a node by examining the state 
of neighboring switches. This locality property of the simulation algorithm implies the 
simulation of constituent subcircuits is independent. The simulations of two logic gates 


separated in space are independent over short periods of time. 
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This independence property has several interesting implications for the design of 
parallel simulation tools. First, it promises to be unaffected by scale. The potential 
parallelism increases linearly with the size of the circuit to be simulated. Second, it 
implies homogeneity of processing. Each processor can run the same simulation code 
on its own piece of the circuit. Third, the circuit database can be distributed across the 
multiprocessor. This eliminates the potential bottleneck presented by a shared network 
database, and allows the simulator to take advantage of the natural structure of the 
circuit. 

In this chapter a framework for circuit simulation is presented which takes ad- 
vantage of the independence inherent in circuit operation to achieve a high degree of 
parallelism. The general strategy is to map the circuit onto the target multiprocessor 
such that the parallelism of the simulation reflects the parallelism of the circuit. The 
framework uses a simple message passing approach to communication. Interprocessor 


synchronization is based upon a novel history maintenance and roll back mechanism. 


2.1. A Framework for Parallel Simulation 


There are several desirable properties our framework should have. First, the re- 
sulting simulator must be scalable. As the number of devices in the circuits that we 
wish to simulate increases, the performance of the simulator must also increase. There- 
fore, the framework should be capable of scaling to an arbitrary number of processors. 
Second, the framework should be relatively independent of the simulation algorithm. 
We would like to be able to apply the same strategy to a wide range of tools, from low 
level MOS timing analyzers to high level architectural simulators. Third, to permit 
our scheme to run on a variety of general purpose parallel machines, we must make no 
special demands of the underlying processor architecture. In particular, to be capable 
of running on both tightly and loosely coupled multiprocessors, a simulator should im- 
pose as few restrictions as possible on the nature of the interprocessor communication 
mechanism. We would like to avoid relying upon shared memory and imposing limits 


on message latencies. 
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The strategy we shall follow is to map the circuit to be simulated onto the topology 
of the target multiprocessor. For simulation on an n processor system, the circuit to be 
simulated is first broken into n subcircuits, or partitions. Each partition is composed 
of one or more atomic units, e.g., gates or subnets. An atomic unit is the collection 
of local network information necessary for the simulation algorithm to determine the 
value of a circuit node. Each processor is then assigned the task of simulating one 
partition of the circuit. Figure 2.1 demonstrates graphically the decomposition of a 


network of atomic units into two partitions. 


Partition A : ; Partition B 
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Figure 2.1. Partitioning a Network 


The straight lines crossing the partition boundaries represent communication links 
between logically adjacent atomic units which have been placed in different partitions. 
In actual circuit operation, separate components communicate via the signals carried by 
electrical connections they have in common. Similarly, in simulation adjacent atomic 
units communicate only via the values of shared nodes. Therefore, the information 
which must be passed along the communication links consists of node values only. There 
is no need to share a common network database or pass non-local network information 
between partitions. 

Communications issues tend to dominate the design of large digital circuits. Suc- 
cessful designs must constrain communication between submodules to meet routing 


and bandwidth requirements imposed by the technology. These constraints are similar 
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to those imposed by some multiprocessor architectures. Such constraints are often the 
source of performance limitations in parallel processing. Because the communication 
structure of the simulation in our framework is closely related to that of the actual 
circuit, our framework can easily utilize the natural modularity and optimizations in a 
circuit design to reduce interpartition, and hence interprocessor, communication. 

In order to further reduce communication and to guarantee a consistent view of 
the state of the network across all processors, we shall enforce the restriction that 
the value of every node is determined by exactly one partition. Therefore, the links 
shown in Figure 2.1 will be unidirectional; a node may be either an input or an output 
of a partition, but never both. If more than one partition were allowed to drive a 
particular node, each partition would require information about the state of the other 
drivers to determine the correct value of the node. By eliminating the possibility of 
multiple drivers we eliminate the need for this non-local information and the extra 
communication required to arbitrate such an agreement. 

This is not as serious a restriction as it first appears. In an MOS circuit, it 
implies all nodes connected through sources or drains of transistors, such as pullup and 
pulldown chains and pass transistor logic, must reside in the same partition. Since such 
structures are the components of higher level logic gates, it makes sense to keep them 
close together. The only difficulty arises from long busses with many drivers. This case 
results in a “bit slice” style of partitioning, where all of the drivers for one bit of the 
bus reside in the same partition, but different bits may reside in separate partitions. 
Since there tends to be relatively little communication from one bit to another, this 


restriction actually obeys the natural decomposition of digital circuits. 
2.2. Synchronization 


2.2.1 Precedence Constraints 


A node shared between two partitions represents a precedence constraint. Enforc- 


ing this precedence constraint requires additional communication and can introduce 
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delay in a poorly balanced simulation. Consider the circuit in Figure 2.2. Let T(A) be 
the current simulated time of partition A, and T(B) be the current simulated time of 
partition B. For B to compute the value of node Y at t, it must determine the value 
of node X at t,. If at the point where B requests the value of node X, T(A) < ty (i.e. 
A is running slower than B), the request must be blocked until T(A) > t,, potentially 
suspending the simulation of B. This interruption results from the need to synchronize 


the simulations of partitions A and B. 


Figure 2.2. Data Dependence Between Two Partitions 


The circular precedence constraint introduced by feedback between two (or more) 
partitions can result in a forced synchronization of the simulations. In Figure 2.3 
feedback has been introduced into the previous example by connecting node Y of 
partition B to node T of A. Each gate is assumed to have a delay of r seconds. If A 
has computed the value of X at T(A) = to, B is free to compute the value of Y at 
to +7. However, for A to proceed to compute the value of X at to + 27, it must wait 
until T(B) >to +7, that is until B has finished computing Y at to +7. The feedback 
has forced the two partitions into lock step, with each partition dependent upon a value 


computed during the previous time step of the other. 


2.2.2 Input Buffering 


These synchronization problems arise from the coupling between partitions intro- 
duced by shared nodes. With this in mind, the following observation can be made: 


If all partition inputs remained constant, there would be no precedence constraints to 
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Figure 2.3. Data Dependence With Feedback 


enforce. Each partition could be simulated independently of the others. This principle 
can be used to decouple partitions by introducing a level of buffering between each par- 
tition, as shown in Figure 2.4. Each partition maintains a buffer for each input node. 
Simulation is then allowed to proceed based upon the assumption that the currently 


buffered value of each input will remain valid indefinitely. 


Figure 2.4. Input Buffering Between Partitions 


When a partition changes the value of an output node, it informs all other par- 
titions for which that node is an input. This is the basic form of interpartition com- 
munication. Changes in shared node values propagate from the driving partition to 
the receiving partitions. The information passed for a node change consists of a triple 
composed of the name of the node that changed, the new value of that node, and the 
simulated time the change took place. The receiving partitions use this information to 


update their input buffers, and, if necessary, correct their simulations. 


2.2.3 Roll Back Synchronization 


To maintain a consistent state of the network across the multiprocessor, some form 
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of synchronization is necessary. In the previous example, it is possible for partition B 
to get sufficiently far ahead of A that its assumption of constant inputs will result in 
incorrect simulation. Some form of correction is necessary. To this end, we employ 
a checkpointing and roll back strategy derived from the state restoration approach to 
fault tolerance in distributed systems [16] [17]. As the simulation progresses, a partition 
periodically stops what it is doing and takes a checkpoint of the current state of the 
simulation. This action is analogous to entering a recovery block in [16]. The checkpoint 
contains a record of all of the pieces of state in the partition: the value of every node, 
all pending events, and any state information kept by the simulation algorithm (e.g., 
the current simulated time). From this checkpoint, the simulation of the partition can 
be completely restored to the current state at any future time, effectively rolling the 
simulation back to the time the checkpoint was taken. The set of saved checkpoints 
forms a complete history of the simulation path from the last resynchronization up to 


the current time. 


When a partition receives an input change, one of two possible actions will occur. 
If the simulated time of the input change is greater than the current time, a new event 
representing the change is scheduled and simulation proceeds normally. However, if 
the simulated time of the input change is less than the current time, the simulation 
is “rolled back” to a point preceding the input change. This roll back operation is 
accomplished by looking back through the checkpoint history to find the most recent 
checkpoint taken prior to the scheduled time of the input change. The simulation state 
is then restored from that checkpoint, a new event is scheduled for the input change, 


and simulation is resumed from the new simulated time. 


Figure 2.5 shows a partial history of the simulation of two partitions, A and B. 
The time line represents the progression of simulated time. The “X” marks represent 
the times at which checkpoints were taken. The broken vertical line indicates a node 
change directed from one partition to another. The current time of each partition is 


shown by the corresponding marker. 
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Figure 2.5. Simulation Before Roll Back 


The snapshot shows the point when partition B notifies A that the value of a shared 
node changed at t2. Upon receipt of the input change message, the simulation of A is 
suspended and the checkpoint history is searched for the most recent checkpoint prior 
to ta. The state of A is then restored to time t, from the appropriate checkpoint. An 
event is scheduled for tz to record the change of the input node. The old simulation path 
beyond t; is now invalid, so all checkpoints taken after t; are thrown away. Partition 
A is now completely restored to t; and simulation may continue. Figure 2.6 shows 
a snapshot of the simulation immediately following the completion of the roll back 


operation. 
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Figure 2.6. Simulation After Roll Back of Partition A 


2.2.4 Consistency Across Roll Back 


To maintain consistency across roll back, additional communication is required. 


Figure 2.7 shows the interactions among three partitions. At tz partition C notifies B 


—~ 24 - 


that a shared node has changed value. Since T(B) > ts, B is forced to roll back to the 
most recent checkpoint prior to t3, which is at tp. The node change from C to B does 
not directly effect A. However, since B will embark upon a new simulation path from 
to, the input change B sent to A at t2 will be invalid. To ensure the consistency of A, a 
roll back notification message is introduced. Upon rolling back, B sends A a roll back 
notification message informing it that any input changes from B more recent than to 
must be invalidated. This does not necessarily force A to roll back. If T(A) < te, the 
time of the earliest input change from B more recent than tg, A need only flush the 


input change at t2. If T(.A) > t2, A would be forced to roll back to a point prior to to. 
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Figure 2.7. Roll Back Notification 


The roll back notification procedure can be optimized if each partition maintains 
a history of output changes to implement a change retraction mechanism. At each 
time step, a partition checks the output history for the current simulated time. If, in 
a previous simulation path, an output change occurred which did not take place in the 
current path, a retraction is sent to all dependent partitions, and the output change is 
removed from the history. If the change did occur in the current path, no new change 
messages are necessary. Consider Figure 2.7. Since the change which forced B to roll 
back occurred at t3, B will follow the same simulation path from to to t3, making the 


same node change at t2. Therefore, B need not resend this change to A. A will not be 
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forced to roll back even if T(A) > te. 

We must still address the problem of convergence in the presence of feedback. With 
the scheme outlined so far, it is possible for two partitions with a circular dependence 
to synchronize, with each partition repeatedly forcing the other to roll back. Figure 2.8 
demonstrates this problem. When B notifies A of the change at tz, A will be forced to 
roll back to to. If B progresses beyond tg3 before A reaches t3, B will be forced to roll 
back to t;. Once again, when B reaches ¢2, A will be forced back to to, and the cycle 


repeats forever. 
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Figure 2.8. Convergence Problem in the Presence of Feedback 


If B had taken a checkpoint at t such that te < t < t3, it would not have forced 
A to roll back, and the cycle would have been avoided. However, if the changes occur 
simultaneously (tg = ts), we are again faced with the infinite cycle. To solve this 
problem, we first make the following assertion about the nature of the simulation 
algorithm: the elapsed simulated time between an input change and any resulting new 
events is non-zero. This assertion can be made true by proper partitioning of the 
network. This restriction allows the simulation of a single time step to be sub-divided 
into two distinct phases: 
1. the processing of all internally generated events queued for the current 
simulated time, including the propagation of output changes to other 
partitions; 


2. the processing of all externally generated input changes queued for the 
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current simulated time. 


This in turn permits us to take a checkpoint between the two phases of the simulation, 
after any output changes have been made and before any input changes have been 
processed. Returning to the example of Figure 2.8, if B were to take a checkpoint at 
tz, it could be rolled back safely without causing a further roll back in A, even in the 
limit of t2 = t3. Forward progress is assured if we can guarantee there will always be 
a checkpoint in the right place. 

The convergence problem is related to the “domino effect” observed in distributed 
systems, where one failure can cause many interdependent processes to repeatedly 
roll back until they reach their initial state [16][17]. In the context of simulation we 
have shown that this problem arises from synchronization of precedence constraints 
imposed by the partitioning. Under these circumstances, the best that can be done, 
short of dynamically repartitioning to ease the constraints, is to guarantee convergence. 
This is done by subdividing the simulation of a single time step into two phases, and 


checkpointing between the phases. 


2.2.5 Checkpointing 


The checkpointing strategy must meet the following constraints: the checkpoint 
must contain all of the state necessary to completely restore the simulation; there must 
always be at least one consistent state to fall back to; and it must be possible to make 
forward progress in the event of unexpected synchronization. In addition to these 
constraints, there are some less important but still desirable properties a checkpoint 
strategy should have. For example, to prevent rolling back further than necessary, the 
simulation should be checkpointed frequently. In the limit, a checkpoint at every time 
step would eliminate redundant work. We would also like the checkpointing process to 
be as inexpensive in both space and time as possible. There is a tradeoff between the 
cost we are willing to pay when forced to roll back and the cost we are willing to pay 
for checkpointing overhead. 


We expect the communication between partitions in a statically well-partitioned 
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circuit to be clustered in time, e.g., around clock edges. This implies the probability 
of receiving a node change is greatest immediately following a change, and decreases 
as the time since the last change increases. The probability of roll back should follow 
a similar pattern. Therefore, to reduce the amount of redundant simulation caused 
by rolling back, we would like to have a high density of checkpoints in the vicinity of 
communication clusters. If the dynamic balance of the partitioning is less than ideal, 
some of the partitions will simulate faster than others. In this case, the amount of 
redundant work forced upon the faster partitions by roll back is less critical, as they 
will still catch up to and overtake the slower partitions. Therefore, if the time since 
the last roll back is large, we can afford to reduce the density of checkpoints. 

These observations have lead to a strategy of varying the frequency of checkpoint- 
ing with time. Following each resynchronization and each roll back, a checkpoint is 
taken at every time step for the first several steps, thus ensuring forward progress as 
well as providing a high density of checkpoints. As the simulation progresses, the num- 
ber of time steps between checkpoints is increased up to some maximum period. The 
longer the simulation runs without rolling back, the lower the checkpoint density, and 
hence the overhead, becomes. We have arbitrarily chosen to use an exponential decay 
function for the frequency until we have a better model of the probability distributions 


of interpartition communication. 


2.3. Partitioning 


The overall performance of the simulator is determined by two factors: proces- 
sor utilization, and communication costs. Both of these factors are influenced by the 
manner in which the network is partitioned. To maximize processor utilization, the 
simulation load must be evenly distributed among the processors. This implies par- 
titioning the circuit into pieces of roughly equal size and complexity. To minimize 
communication costs, the number of links between partitions should be minimized. 
There are a number of classical graph partitioning algorithms which address both of 


these criteria [10][11]. 
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For example, consider the data path block diagram shown in Figure 2.9. A static 
analysis of this circuit shows most of the communication paths are horizontal, from 
left to right. Only in the carry chain of the ALU and in the shifter will there be any 
communication from bit to bit. A static min-cut algorithm would partition this circuit 
into horizontal slices, following the flow of information along each bit. One would 


expect this partitioning to result in an even load balance, with little interprocessor 


communication. 
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Figure 2.9. Data Path Floor Plan 


Unfortunately, there are dynamic components to both processor utilization and 
communication with which static partitioning algorithms are unable to cope. For ex- 
ample, consider a 16-bit counter to be split into 4 partitions. A static min-cut algorithm 
would divide this circuit into four 4-bit slices, in the same manner as the data path 
above. Each partition would be exactly the same size, have only one input (the carry 
in) and one output (the carry out). At first glance, this would seem to be a fine parti- 
tioning. The dynamic behavior, however, will be quite poor. Both the simulation load 
and the communication decrease exponentially from the low order partition to the high 
order one, with the low order partition doing eight times the work of the high order 
one. A more effective partitioning would have placed bit 0 of the counter (the low order 
bit) in the first partition; bits 1 and 2 in the second partition; bits 3-6 in the third; and 
bits 7-15 in the last. The dynamic load would then be much more evenly distributed. 

Clearly, a partitioning strategy based only upon the static structure of the circuit 


will not fare well under a wide range of applications. Some knowledge of the dynamic 
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behavior of the simulation is necessary. One approach would be to begin with a static 
partitioning, but dynamically repartition the network during the simulation by shuffling 
atomic units between processors to optimize the load balance and communication. This 


topic is beyond the scope of this thesis, and deserves future investigation. 


2.4. Summary 


In this chapter we have presented a framework for simulation which takes advan- 
tage of the parallelism inherent in digital circuit operation. We proposed a scheme in 
which the circuit to be simulated is partitioned onto the topology of the multiproces- 
sor, with each processor responsible for the simulation of one partition. We discussed 
the problems of synchronization introduced by this approach, and developed a solu- 
tion based upon a history maintenance and roll back mechanism. This solution was 
demonstrated to be sufficient to guarantee convergence in the presence of feedback. Fi- 
nally, we discussed the importance of good partitioning, and showed that static graph 
partitioning algorithms may not be adequate. 

We began this chapter by setting out three goals for a parallel simulation frame- 


work. Let us now see how close our proposed framework comes to those goals. 


e The framework is scalable to a large number of processors. As the size of 
the circuit grows, we can increase the number of partitions, keeping the 
average size of the partitions constant. The factors which will probably 
limit the scalability will be the interprocessor communication mechanism 
(e.g., bandwidth, congestion), and the effectiveness of the partitioning 


algorithm. 


e The framework does impose some constraints upon the nature of the sim- 
ulation algorithm. We require an event based simulator which exhibits a 
high degree of locality. A wide range of simulation tools will fit this de- 
scription, but we exclude most low level circuit analysis programs, such 


as SPICE. 


e The framework has few requirements of the underlying multiprocessor 
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architecture. The small amount of communication required makes it 
suitable for both tightly and loosely coupled systems. The overall per- 


formance should degrade gracefully with increasing message latencies. 


2.5. Related Work 


The problems of parallel simulation have received a great deal of attention re- 
cently. A number of the resulting research efforts have influenced the work reported in 
this thesis. Among the most influential have been the work on the MSPLICE parallel 


simulator and the Virtual Time model for distributed processing. 


2.5.1 MSPLICE 


MSPLICE is a multiprocessor implementation of a relaxation based circuit simulator 
[6]. The algorithm employed is known as Iterated Timing Analysis, and is based upon 
Newton-Raphson iteration to approximate the solution of the node equations which 
describe the circuit. It makes use of event driven, selective trace techniques similar to 
those employed by SPLICE to minimize the amount of computation required per time 
step of simulation [18]. 

The Iterated Timing Analysis method is extended for implementation on a mul- 
tiprocessor by a “data partitioning” technique. The circuit to be simulated is divided 
into sub-circuits, with each sub-circuit represented by a separate nodal admittance ma- 
trix. Each sub-circuit is then allocated to a processor. Each processor, operating on the 
same time step, applies the ITA algorithm to each of its sub-circuits until convergence 
is reached. When every sub-circuit on every processor has converged, the simulation 
advances to the next time step. Synchronization is achieved through a global variable 
which represents the count of outstanding sub-circuit events for the current time step. 

The approach to parallelism followed by MSPLICE is quite close to that of our pro- 
posed framework. Both schemes seek to exploit the parallelism inherent in the circuit 
through a data partitioning strategy: the circuit to be simulated is distributed across 


the multiprocessor, with each processor running the same algorithm on different data. 
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There are several important differences, though. The MSPLICE algorithm is necessarily 
synchronous, with all of the processors simulating the same time step. This has two 
important implications. First, the time required to simulate a particular time step is 
determined by the slowest partition. Second, additional communication is required to 
manipulate the global synchronization counter. Because of the nature of the relaxation 
method, MSPLICE does not have the same locality properties as our framework. The 
information necessary to compute the node values of a given sub-circuit is not necessar- 
ily local to a single processor. For each iteration, each processor must fetch the current 
values of all of the fanin nodes for each sub-circuit, and propagate events to all of the 
fanout nodes. The communication requirements of MSPLICE imply a dependence upon 
shared memory and a tightly coupled multiprocessor architecture, which we have tried 


to avoid. 


2.5.2 Virtual Time 


Virtual Time is a model for the organization of distributed systems which is based 
upon a lookahead and rollback mechanism for synchronization. In this model, processes 
coordinate their actions through an imaginary Global Virtual Clock. Messages trans- 
mitted from one process to another contain the virtual time the message is sent and 
the virtual time the message is to be received. If the local virtual time of the receiver 
is greater than the virtual time of an incoming message, the receiving process is rolled 
back to an earlier state [8]. 

The basic strategy of Virtual Time is quite close to that followed by our simulation 
framework presented earlier. Both propose the use of state restoration as a mechanism 
for the synchronization of parallel processes. The principal difference is that Virtual 
Time is proposed as a general model for all forms of distributed processing. We are 
only using the roll back synchronization in a very limited, very well characterized 
domain. This has several implications. First, we take advantage of knowledge about 
the context to strictly limit the amount of state information we must keep. The Virtual 


Time model requires saving the entire state of the process, including the stack and all 
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non-local variables, at every checkpoint. Second, we have organized the problem such 
that the amount of interprocessor communication is quite small. This in turn leads 
to relatively infrequent roll backs. Third, we are able to make assumptions about the 
distribution of the communication to reduce the frequency of checkpointing. It is not 
clear how frequently the state must be saved in the Virtual Time system. Fourth, by 
subdividing the simulation time step and carefully choosing the checkpoint strategy, 
we are able to guarantee the convergence of the simulation. The general convergence 
properties of Virtual Time are less well characterized. By taking advantage of the 
structure of the simulation algorithm, the history maintenance and roll back approach 


to synchronization becomes much more tractable. 
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Chapter III 


Implementation 


It is all very well to theorize about parallel processing, but the best way to assess 
the efficacy of a new idea is to try it. A simulator based upon the parallel framework 


presented in Chapter Two was designed and built with the following goals: 


e to determine whether the roll back approach to interprocessor synchro- 


nization can be made cost effective in the context of circuit simulation; 


e to produce a fast, scalable circuit, simulator capable of simulating the 


next generation of VLSI circuits efficiently. 


This chapter discusses the details of the implementation of that simulator. 


3.1. Foundations 


Parallel RSIM, or PRSIM, is a distributed circuit simulator which employs the his- 
tory and roll back mechanisms discussed in Chapter Two. As the name implies, PRSIM 
is based upon the RSIM algorithm of [19]. It is implemented on the Concert multipro- 
cessor, developed at MIT [2][7]. 
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3.1.1 The RSIM Circuit Simulator 


RSIM is an event-driven, logic level simulator that incorporates a simple linear 
model of MOS transistors. In RSIM, MOS transistors are modeled as voltage controlled 
switches in series with fixed resistances, while transistor gates and interconnect are 
modeled as fixed capacitances. Standard RC network techniques are used to predict 
not only the final logic state of each node, but also their transition times. This relatively 
simple and efficient model provides the designer with information about the relative 
timing of signal changes in addition to the functional behavior of the circuit without 


paying the enormous computational costs of a full time domain analysis. 


The electrical network in RSIM consists of nodes and transistors. Any MOS circuit 
can be naturally decomposed into subnets if one ignores gate connections; the resulting 
subnets each contain one or more nodes which are electrically connected through the 
sources or drains of transistors. The nodes connected to gates of devices in a subnet 
are the inputs of the subnet, and the nodes which are inputs of other subnets are the 
outputs of the subnet. Note that a node can be both an input and output of a single 


subnet. 


Subnets are the atomic units of the simulation calculation; in general RSIM will 
recalculate the value of each node of a subnet if any input to the subnet changes. If, as 
a result of the recalculation, an output node changes value, an event is scheduled for 
the simulated time when the output is calculated to reach its new value. Processing an 
event entails recomputing node values for subnets that have the changing node as an 
input. 

Internally, RSIM maintains a single event list where all unprocessed events are kept 
in order of their scheduled time. When a node changes value, all other nodes which are 
affected by that change are examined. For each affected node that changes value, the 
simulated time of the change is computed and an event is added to the event list in the 
appropriate place. The next event to be processed is then taken from the beginning 


of the list, and the cycle repeats itself. A simulation step is considered complete when 
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the event list is empty, i.e. when no more changes are pending. 


3.1.2 The Concert Multiprocessor 


Concert is a multiprocessor test bed designed to facilitate experimentation with 
parallel programs and programming languages. It is organized as a ring of clusters, 
with 4 to 8 Motorola MC68000 processors in each cluster, as shown in Figure 3.1. 
The processors in each cluster communicate via shared memory across a common bus, 
although each processor has a private, high speed path to a block of local memory. 
The clusters communicate via globally accessible memory across the RingBus. Each 


processor therefore sees a three level hierarchy of memory: 


1. high speed memory accessible over the processor’s private “back door” 
path (this memory is still accessible to other processors in the cluster via 


the shared bus); 
2. slower, non-local cluster memory accessible over the shared cluster bus; 
3. global memory, accessible only through the RingBus. 


All three levels of the hierarchy are mapped into the address space of each processor. 
Therefore, the memory hierarchy can be treated transparently by the user program if 
it is convenient to do so. Note that non-global cluster memory is not accessible from 
the RingBus [2][7]. 

Over time, a large set of subroutine libraries have been developed for the Concert 
system. One such library, the Level 0 Message Passing library, implements a reliable 
message delivery system on top of the Concert shared memory system. For each proces- 
sor there exists a message queue in global memory. To send a message, the LO system 
copies the message body into global memory if it is not already there, and places a 
pointer to the top of the message body into the receiving processor’s queue. To receive 
messages, the queue is polled on clock interrupts. Messages on the queue are removed 
and returned to the user program by a user-supplied interrupt handler. The LO package 


also provides a set of functions for sending and receiving messages. 
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Figure 3.1. The Concert Multiprocessor 


The original RSIM program used floating point arithmetic for the Thevenin and 
RC calculations. Concert has no floating point hardware, so it was felt that rather than 
emulate the floating point arithmetic in software, it would be more efficient to use scaled 
fixed point arithmetic. A 32-bit integer can represent a range of roughly 9 decimal 
orders of magnitude, more than sufficient for the ranges of resistance, capacitance, and 
time found in contemporary MOS simulation. The actual ranges of the units used by 


PRSIM follow: 
0.19 <R < 100M 


10-°pF <C < 1000pF 
0.1nS <t < 100mS 


To represent the products and quotients of these units without loss of precision, a 
scaled arithmetic package using 64-bit intermediate results was written. The routine 
RCMul(R, C) computes the 64-bit product of a resistance and a capacitance, and then 
divides by a constant scale factor to produce a 32-bit time quantity. The routine 
MulDiv(A, B, C) multiplies any two 32-bit integers, and divides the 64-bit product by 
a third 32-bit integer to yield a 32-bit result. This is useful for the Thevenin resistance 
calculation. Finally, the routine CvtCond(R) converts a resistance to a conductance 


(and vice versa) by dividing its argument into a 64-bit constant to yield a scaled 32-bit 
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result. 


3.2. The Organization of PRSIM 


The PRSIM system consists of two phases: a prepass phase and a simulation phase. 
The prepass phase is responsible for partitioning the network to be simulated and 
for compiling the result into an efficient machine readable format. The simulation 
phase itself can be further broken down into a coordinating program and a simulation 
program. In an n node multiprocessor, 1 processor is dedicated to the user interface and 
coordination functions, while the remaining n — 1 processors do the actual simulation 


work. This organization is illustrated in Figure 3.2. 


Figure 3.2. Structure of PRSIM 


3.2.1 The Prepass Phase 


The operation of PRSIM begins with the circuit to be simulated expressed in the 
lisp-like description language NET [20]. + In the NET description the user may also spec- 
ify the desired partitioning of the circuit. From this high level description, the PRESIM 


{ At present, PRSIM has no automatic partitioning system. When such a mechanism is available, 


PRSIM will also be able to simulate a circuit extracted from a mask level description. 
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program, running on a conventional computer, first partitions the circuit into n — 1 
pieces based upon the user’s specification and the constraints imposed by the parallel 
framework and the RSIM algorithm. Next, the dependencies between the partitions are 
determined and the mapping tables used by each partition and by the coordinator are 
constructed. Each output node of each partition is given a list of partitions for which 
that node is an input. Finally, n binary files are produced, one for each partition and 


one for the coordinator. 


3.2.2 The Coordinator 


The coordinator attends to the administrative functions of the simulation. These 
tasks include: 
e loading the network files for each of the partitions from the host com- 
puter; 
e running the user interface to the simulator, including getting and setting 
node values; 


e starting, stopping, and resynchronizing the simulation. 


The coordinator handles all input and output with the host computer. Upon ini- 
tialization it searches out the active processors in the system and reads the coordinator 
file generated by PRESIM from the host to obtain the number of partitions to be simu- 
lated. For each circuit partition it assigns a processor from the active pool and passes 
it the name of the appropriate network database file. Each slave processor is then re- 
sponsible for reading the appropriate file by sending read requests to the host through 
the coordinator. 

PRSIM supports two different user interface languages: a simple line-at-a-time 
command interpreter for simple operations, and a lisp-like language for more elaborate 
control structures [20]. Through either of these interfaces the user may get and set 
node values, examine the network structure, and start or stop the simulation. 

Each node in the circuit is identified by a globally unique identifier, or node ID, 


which is assigned during the prepass phase. The coordinator maintains a table of node 
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entry data structures, one for each node in the circuit. This table can be referenced 
in two different ways: indexed by global node ID, for mapping IDs into names for the 
user; and hashed on the ASCII name of a node, for mapping the user specified ASCII 
names into global node IDs. In addition to this two-way mapping, the node entry 
structure also identifies the partition responsible for driving the node and contains a 
list of partitions for which this node is an input. This information is used to permit 


the user to examine and set node values. 


When the user requests the value of a particular node, the ASCII name provided by 
the user is first mapped into the corresponding node ID by the hash table. A message 
requesting the value of the node is sent to the partition responsible for computing 
that value. The partition then looks up the value of the node and sends back a reply 
message. When the user wishes to set the value of a node, the coordinator sends the 
driving partition a message containing the ID of the node, the new value for the node, 


and the simulated time of the change. No reply is necessary. 


To start a simulation step, the coordinator first establishes user supplied input 
conditions by sending input change messages as necessary to the slave processors. When 
all of the input changes have been established, the coordinator starts the simulation by 
sending a STEP message containing the desired termination time to each slave processor. 
When each processor reaches the specified stop time, it sends a SETTLED message back 
to the coordinator and waits. Since a processor may be forced to roll back after it has 
reached the stop time, roll back notifications are sent to the coordinator as well. With 
this information, the coordinator keeps track of the state of the simulation of each 
partition. When it has determined that all of the slave processors have safely reached 
the stop time, the coordinator sends a RESYNC message to each slave to inform it that 


its old history is no longer needed and may be reclaimed. 


In the current implementation the simulation is resynchronized only at the termi- 
nation of each test vector. Since there is some overhead costs associated with starting 


and stopping the simulation, the longer the simulation is allowed to run asynchronously, 
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i.e., the longer the test vector, the less significant the overhead cost will be. However, 
since checkpoint histories are only reclaimed at resynchronization time, the amount of 
storage devouted to checkpointing becomes the factor which limits the length of the test 
vectors. In future implementations, a mechanism for pruning old checkpoints together 
with automatic resynchronization initiated by the coordinator could be used to extend 


the length of the vectors. 


3.2.3 The Simulation Slave 


The simulation slave program is composed of three components: the simulation 
loop; the interprocessor communication mechanism; and the history and roll back syn- 
chronization mechanism. The simulation control loop is shown Figure 3.3. CurTime 
is the current simulated time of the partition, and StopTime is the termination time 


specified by the coordinator. 


while CurTime < StopTime 
{ /* process events queued for CurTime */ 
for each event scheduled for CurTime 
process event; 
send queued output changes; 
if time to checkpoint 
checkpoint (); 
/* end of phase one */ 


/* process inputs queued for CurTime */ 

for each event scheduled for CurTime 
process input; 

/* end of phase two */ 

CurTime = CurTime + 1; 


Figure 3.3. Simulation Control Loop 


The processing of events proceeds as follows. For each event scheduled for CurTime, 
the event is removed from the list, the specified node change is made, and the effects 
are propagated through the partition. If the node specified in the event is an output, 
the event is added to the output change list. When all events scheduled for CurTime 
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have been processed, one input change message is constructed for each partition which 
is dependent upon one or more of the outputs in the list. Each message contains the 
value of CurTime and the ID and new value of each node in the list which is an input 
to the receiving partition. Once the input change messages have been sent, the output 
change list is cleared, completing the first phase of the simulation. At this point, if 
a sufficient period of time has elapsed since the last checkpoint, a new checkpoint is 
taken (see Section 3.4 for more detail). 

The operation of the second phase of the simulation is similar. For each input 
change there is a data structure which contains the ID of the input node, the new 
value, and the simulated time of the change. These structures are kept in the doubly 
linked InputList sorted by simulated time. The NextInput pointer identifies the 
next input change to be processed. For each input change scheduled for CurTime the 
specified node change is made and the effects propagated through the network. After 
each change is processed, the NextInput pointer is advanced. The InputList remains 
intact. 

By subdividing the simulation of a single time step into the two phases shown, and 
by checkpointing at the end of the first phase, any roll back will restore the simulation to 
the beginning of the second phase. Since the elapsed time between an input change and 
any resulting event is non-zero, the simulation will converge in the manner described 


in Chapter Two, although it may require several roll back operations. 


3.3. Communication 


There are two classes of interprocessor communication in the PRSIM system: ad- 
ministrative communication with the coordinator for such purposes as loading the 
partition data base and answering queries from the user; and interpartition communi- 
cation required for sharing circuit nodes across multiple partitions. Both of these forms 
of communication make use of a low level message management system which itself is 
built upon the reliable message delivery protocol of the Concert Level 0 system. 


Figure 3.4 shows the structure of a PRSIM message. The whole message consists 
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Figure 3.4. PRSIM Message Structure 


of two components: a Level 0 header, which is used by the Concert Level 0 software, 
and the PRSIM message itself. The PRSIM message is further composed of a message 
header and a body. This header contains two links for the doubly linked active message 
list; a request ID for matching replies to synchronous requests; an opcode field which 
identifies the type of message; a size field which determines the length of the message; 
and finally the message body, which contains the data. Message bodies are a multiple 
of 16 bytes in length, up to a maximum of 1024 bytes. The body size of a message 
is determined when the buffer for the message is first allocated. When a message has 
finished its task, its buffer is returned to a free list managed by the sending processor, 
from which it may be reallocated later. To avoid searching one free list for a buffer 
of a certain length, there are 64 separate free lists, one for each possible message size. 
Messages of the same size are returned to the same free list. A complete list of PRSIM 
messages appears in Appendix A. 

To send a message, a processor obtains a buffer of the appropriate size from the free 
list, allocating a new one if necessary, and fills in the body. Next, the busy flag in the 


Level 0 header is set and the message is added to the active list. Finally, the message 
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is placed in the receiving processor’s Level 0 queue, and the sending processor returns 
to whatever it was doing. At the receiving end, during clock interrupts and when the 
processor is idle, an interrupt handler polls the Level 0 queue for that processor. If 
there are any new messages, they are removed from the Level 0 queue and added to 
an internal message queue, which the program itself polls at convenient intervals. This 
internal message queue serves to isolate the “user level” program (coordinator or slave) 
from the “interrupt level” message handling, and allows the program to synchronize 
message processing with its own internal operation. To process a message, the user 
program removes it from the internal queue and dispatches on the Opcode field to the 
appropriate handler routine. When the handler is finished, it clears the busy flag in 
the message and returns. The sending program periodically searches through its list of 
active messages, reclaiming those that are no longer in use. 

On top of the non-blocking message passing mechanism described above, a simple 
synchronous request/reply scheme was implemented. This feature is used primarily 
for debugging purposes and to answer queries from the user. For example, the slave 
processors use this mechanism to obtain the ASCII name of a node from the coordinator 
when printing debugging information. The RequestID field of the message is used to 
match incoming replies with outstanding requests. All other messages are left in the 


queue unprocessed until all pending requests have received replies. 


3.4. History Mechanism 


Chapter Two discussed the requirements the history maintenance mechanism must 
meet. These are summarized below. 
e The checkpoint must contain all of the information necessary to com- 
pletely and atomically transform one consistent simulation state to an- 
other. There must be no period in which inconsistent results may be 


given. 


e It must be possible to make forward progress under all possible circum- 


stances. This does not imply we must make forward progress after every 
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roll back, but eventually the simulation must converge. 


In addition to meeting the above constraints, we would like the history mechanism to 
be efficient in both time and memory, as these costs represent part of the overhead 


associated with parallel execution. 


3.4.1 Simulation State Information 


We can take advantage of the nature of the simulation algorithm to minimize the 
amount of state information that must be checkpointed. As shown in Chapter Two, 
this information includes the internal state of the circuit, the state of externally applied 
inputs, and the state of the algorithm itself. The state of the circuit consists of the 
logic state of each node in the network. The history of externally driven node values 
comes for free by maintaining the input list throughout the simulation. The state of 
the simulation algorithm consists of the contents of the event lists and the current 
simulated time. Since checkpointing and roll back occur only at specified places in the 
slave program, no other process state (i.e., the stack) need be saved. 

All of the state information is kept in a data structure known as the checkpoint 
structure. The list of extant checkpoint structures is kept sorted by simulated time. 
The data structure contains a time stamp to identify the simulated time the checkpoint 
was taken, an array of pointers to the saved event lists, and an array of node values. 


The procedure for filling the checkpoint structure is described below. 


1. Allocate a new checkpoint data structure. Mark it with the current 


simulated time and add it to the end of the checkpoint list. 


2. Make acopy of each event in the event wheel and add it to the appropriate 


list in the checkpoint structure’s event array. 


3. Visit each node in the network, recording its value in the node array of 


the checkpoint structure. 


For each node in the network, the checkpoint procedure must record its state (0, 


1, or X) and whether the user has declared it to be an input. Therefore, three bits 
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of information are needed to completely specify the state of a node. For the sake of 
simplicity and performance, two nodes are packed into each byte of the node array 
(it would be more storage efficient but slower to store 5 nodes per 16-bit word). The 


procedure to checkpoint the state of the network is shown in Figure 3.5. 


/* Array is the node array of the checkpoint structure */ 
CkptNetwork(Array) 
char *Array; 
{ int Index := 0; 
for each node in the network, n 
{ /* Even nodes are put in low order nibble */ 
if Index is even 
{ Array[Index] := NodeValue(n); 
if n is an input 
Array[Index] := Array[Index] ORed with 0x04; 
} 
/* Odd nodes are put in high order nibble */ 
else 
{ Array[{Index] := Array[Index] ORed with 
NodeValue(n) shifted left by 4 bits; 
if n is an input 
Array[Index] := Array[Index] ORed with 0x40; 
Index++; 


} 


Figure 3.5. Checkpointing the Network State 


3.4.2 Checkpoint Strategy 


In Chapter 2 we discussed a strategy to vary the frequency of checkpointing to 
achieve both a high density of checkpoints in the vicinity of communication clusters, 
and a low average overhead when the simulation is well balanced. To this end, we 
define a checkpoint cycle to be the set of checkpoints between any pair of occurrences 
of resynchronization or roll back. 

Figure 3.6 demonstrates the strategy chosen. The checkpoint cycle begins at time 


to. The checkpoints are indicated by Xs. If this cycle was initiated by a resynchroniza- 
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Figure 3.6. Checkpoint Distribution 


tion, a checkpoint is taken at tp to guarantee the simulation can be rolled back to its 
initial state. If the cycle was initiated by a roll back to to, the checkpoint at to is still 
valid, so no new checkpoint is taken. In either case, the state is then checkpointed at 
each succeeding time step for the next three steps, ensuring forward progress will be 
made. At time ¢; the period increases to two steps, at t2 the period increases to four 
steps, and so on. The period increases in this fashion to a maximum period of 1024 
time steps. Both the time constant and the final value of the exponential were chosen 


empirically. 


3.5. Roll Back Mechanism 


The queue of incoming messages is examined at the end of the first phase of the 
simulation loop. If there are any input change messages pending, they are removed from 
the queue and processed. For each entry in each message, an input change structure 
is inserted into the input list at a place specified by the simulated time contained in 
the message. Let to be the simulated time specified in the earliest pending message. If 
CurTime< to, no further action is taken. If CurTime> to, the processor must stop the 
simulation and roll back. To roll back, the processor walks back through the checkpoint 
list to find the latest checkpoint taken at a time t, < to. Each node of the partition 
is visited and its value restored from the node array of the checkpoint structure. All 
events currently on the event lists are thrown away, and the event lists in the checkpoint 
structure are copied into their places. The NextInput pointer is moved back through 
the input change list to point to the next change at time t; > t,. A roll back notification 
message is sent to the coordinator and to all other partitions dependent upon this one. 
Finally, all checkpoints taken after t, are reclaimed for later use (added to a free list). 


Details of the roll back operation are shown in Figure 3.7. The RestoreNetwork routine 
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/* Roll the simulation back to a time before t and restore 
* the state from event checkpoint and node history lists 
*/ 

RollBack(t) 

int t; 
{ struct checkpoint *ctmp; 
/* find closest checkpoint to roll back time t */ 
ctmp := last element of CkptList; 
while time of ctmp > t 
ctmp := previous element of CkptList; 
CurTime := time of ctmp; 
/* walk the network restoring node values */ 
RestoreNetwork(ctmp) ; 
/* restore event array and overflow list */ 
RestoreEvents(ctmp) ; 
/* back up next input pointer */ 
while scheduled time of NextInput > CurTime 
NextInput := previous element of InputList; 
/* Roll back notification to anyone who cares */ 
for each partition in dependent list 
send roll back notification; 


/* garbage collect old checkpoints */ 
for each checkpoint in CkptList > CurTime 
{ remove from CkptList; 
place on FreeCkptList; 
be 


Figure 3.7. Roll Back Procedure 


is similar to the CkptNetwork routine discussed earlier. 

When processor P; receives notification that processor P; rolled back to time to, 
P; must clean up its act to reflect the new knowledge about the state of P;. If P; has 
no record of input changes from P; which are dated more recently than to, nothing 
need be done. If P; has changes from P; more recent than to, those changes are spliced 
out of the input list. If P; has not processed any of those changes (i.e. the earliest 
change is scheduled for a time > CurTime,), no further action is taken. If, however, 
P; has already processed at least one of the changes, the results of those changes must 


be undone. P,; must therefore roll back to a time preceding the earliest of the invalid 
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changes. Note that P,; need not be rolled all the way back to to, but only far enough 
to undo the effects of false changes from P;. Any new changes from P, will explicitly 
force P; to roll back. This response is shown in more detail in Figure 3.8. The history 


and roll back mechanisms are presented in Appendix B. 


/* Respond to Roll Back Notification from processor P at time t *\ 
HandleNotify(P, t) 
{ int earliest; 
struct Input *in; 


/* walk backward from end of InputList to remove inputs from P */ 
in := last element of InputList; 
while scheduled time of in > t do 

{ if in came from processor P 


{ earliest := scheduled time of in; 
remove in from InputList; 
} 
in := previous element of InputList; 


} 


/* Roll back to earliest, if necessary */ 
if (CurTime > earliest) 
RollBack(earliest) ; 


Figure 3.8. Response to Roll Back Notification 


3.6. Summary 


PRSIM is a logic level simulator based upon the RSIM algorithm which takes ad- 
vantage of the locality of circuit operation to achieve parallelism. Interprocessor syn- 
chronization is accomplished through the history maintenance and roll back technique 
presented in Chapter Two. PRSIM makes few demands upon the underlying parallel 
architecture. It requires a reliable, order preserving message delivery substrate for 
communication. There is no need for shared memory, or special hardware for float- 
ing point arithmetic or memory management. The current implementation of PRSIM 
has no automatic partitioning mechanism. The designer must specify the partitioning 


before simulation. 
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The kernel of the original RSIM program (excluding user interface) consists of 
approximately 1430 lines of C code. The simulation slave portion of PRSIM, including 
message handling, contains approximately 2800 lines of C code, or roughly double the 
original size. Of the 2800 lines, approximately 450 lines are dedicated to the history 
maintenance and roll back features, while message handling, file I/O, and debugging 
account for the rest. There are about 800 lines of code dedicated to the coordinator’s 
administrative functions (excluding user interface), split roughly evenly between file 


I/O and message management. 


= 51 = 


eee 
ous 


Chapter IV 


Results 


A preliminary set of experiments were designed and run to determine the perfor- 
mance of the PRSIM implementation. The first set of experiments were designed to 
measure the overall performance of PRSIM, with special emphasis on the scaling behav- 
ior. To completely understand the results of these experiments, extensive performance 
monitoring facilities were added, and a second set of experiments run. This chapter 


presents and discusses the results from those two sets of experiments. 
4.1. Overall Performance 


4.1.1 Experimental Procedure 


To determine the scaling behavior of PRSIM, a set of identical simulations were 
run with a varying number of processors. The set of simulations is composed of one 
test circuit and a large number of randomly generated test vectors. The experiments 
consisted of simulating all of the vectors on each of a number of partitionings of the 


test circuit. 


The number of essential events for a given circuit and set of test vectors is defined 


ih a2 


to be the number of events processed in a uniprocessor simulation. This set of events is 
the standard by which multiprocessor simulations are judged. Therefore, the number 
of essential events processed per second of run time is a measure of the useful (non- 
redundant) work performed. This is the metric by which the overall performance of 
the parallel simulator is measured. To obtain these values, it is necessary to count 
the number of events processed in the one partition experiment, and the amount of 
time elapsed during the simulation of each vector in each experiment. Elapsed time is 
measured in units of clock ticks, where one clock tick ~ 16.2mSec. 

The scaling behavior is most easily expressed in terms of the effective speedup 


factor obtained from a given number of processors. The speedup factor for N processors 


is defined to be: 
t(N) 
Speedup = ——~ 
where t(N) is the time taken to run a given experiment on N processors. The extra 
simulation incurred as a result of roll back can be expressed in terms of the simulation 


efficiency, which is defined to be: 


No. of events(1) 
"0 No. of events(N) 
where No. of Events(N) is the number of events processed in an N partition experi- 
ment. 

The test circuit is a 64-bit adder, using a dynamic logic CMOS technology. The 
adder uses a look ahead carry mechanism for groups of four bits, as shown in Figure 4.1. 
The dynamic logic is clocked by a two phase clock, supplied externally. The carry out 
signal from each group is rippled into the next most significant group of the adder. 
Because the dynamic logic in the carry look ahead block is highly connected, the adder 
will be partitioned along the four bit group boundaries. The only communication 
between the partitions consists of the carry chain. The adder contains a total of 2688 
transistors and 1540 nodes. There are 1328 N-type transistors, 1360 P-type transistors. 


Each 4-bit slice contains 168 transistors, and 96 nodes. 
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Figure 4.1. 4-Bit Slice of Adder Circuit 


Experiments were run with the test circuit partitioned into 1, 2, 3, 4, and 6 
partitions. The organization of each partition is shown in Figure 4.2. The marks 
across the top indicate groups of four bits. In each experiment, all of the partitions are 
of equal length except the six partition case, where the first two partitions contain 8 
bits each, while the rest contain 12 bits. Random test vectors of varying length were 
used. The lengths ranged from 2 to 24 clock cycles, with four sets of vectors in each 


length. 


4.1.2 Results 


A summary of the raw performance data is shown in Table 4.3. The complete 
results are presented in Appendix C. Table 4.3 shows the average performance of PRSIM 
in essential events per second as a function of both the length of the test vector and 
the number of processors. There are a number of discrepancies from what might be 


considered ideal behavior. The first is the decline in raw performance of the one 
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Figure 4.2. Adder Partitioning 


partition experiment as the length of the test vector increases. This is attributed 
to the cost of reclaiming the checkpoint data structures upon resynchronization. Since 
each checkpoint contains an arbitrary number of events, it necessary to walk the length 
of the checkpoint list when reclaiming, incurring a cost proportional to the length of 


the list. 


Number of Processors 


Table 4.4. Simulation Efficiency and Speedup Factor 


Table 4.4 shows the average simulation efficiency and the speedup factor as a 
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function of the number of processors. Table 4.4 demonstrates that the simulation 
efficiency is relatively unaffected by the number of partitions. This indicates a both high 
degree of decoupling between the partitions, with a corresponding low occurance of roll 
back, and a even balance in simulation load, which is consistent with the partitioning 


chosen. 


3 
Speedup 


Factor 


T T T T T 
1 2 3 4 5 6 
Number of Partitions 


Figure 4.5. Speedup Factor Versus Number of Partitions 


The speedup factor results are somewhat more interesting. Figure 4.5 presents a 
plot of the speedup as a function of the number of partitions. With six processors, 
a performance improvement of only 3.77 is achieved. The performance increases less 
than linearly with the number of processors. Clearly, the small decrease in simulation 
efficiency is not the dominant factor. To understand this phenomenon, more detailed 


information is required. 


4.2. Profiling Results 


To understand the performance behavior of PRSIM, it is necessary to build a de- 
tailed model of the costs associated with the various functions. In particular, we need 


to know the following information: 


1. Impact of the partitioning — How well balanced is the simulation load? 


How much interprocessor communication is there? 
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2. Synchronization costs - How much time is spent maintaining the check- 


point lists? How expensive is the roll back operation? 


3. Communication costs — How expensive is message handling? How much 


of that cost is associated with the low level implementation? 


To obtain this information, a statistical profiling scheme similar to that of Version 7 


UNIXt was implemented for the simulation slave program. 


4.2.1 Experimental Procedure 


The profiling scheme collects the number of calls to every subroutine in each sim- 
ulation slave and total amount of time spent in each subroutine. This information is 
sufficient to determine the percentage of the total time that is spent in each subroutine, 
and the average length of time spent per subroutine call. 

When the program starts up, a table to contain the subroutine call information is 
built. Each line of the table contains a pointer to the entry point of one subroutine, and 
the count of the number of times that routine has been called. Each subroutine contains 
a pointer to the corresponding table entry. The compiler automatically inserts code 
at the beginning of every subroutine to manipulate the count table. When the routine 
is first called, it is linked into the table. On each succeeding call, the corresponding 
counter in the table is incremented. When the program exits, the table is written into 
a file to be interpreted later. 

A statistical averaging technique is used to determine the amount of time spent 
in each subroutine of the program. A table of program counter ranges is maintained 
in which each entry represents the number of times the sampled program counter lay 
within a given 8 byte range. At every clock interrupt (once every 16mSec.), the program 
counter is sampled, the value shifted right 3 bits, and used as an index into the array. 
The indexed table entry is then incremented. When the program exits, the table is 


written into a file to be interpreted later. By taking a sufficiently large number of 
t UNIX is a trademark of Bell Laboratories. 
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samples, we can obtain a fairly accurate profile of the amount of time spent in each 
subroutine. 

Profiling data was gathered for the six partition experiment described above. Five 
sets of test vectors, each of length 16, were run. To provide a sufficiently large sample, 
each vector was simulated ten times. Therefore, the sample consists of 800 simulated 
clock cycles of 200nSec. each, or 160uSec. of simulated time. Each vector generates 
roughly 18,000 essential events, for a total of approximately 850,000 events in the 
sample. 

Profiling is enabled by the coordinator immediately before the input vectors are 
established, and disabled immediately after each resynchronization. Therefore, the 


profiling data does not include time spent in the user interface. 


4.2.2 Results 


The complete results of the profiling experiment appear in Appendix D. Table 4.6 
summarizes the percentage of idle time recorded by each processor (time spent in the 
routine step). The idle time is the sum of the time elapsed between reaching the 
specified stop time and the subsequent resynchronization or roll back. The high idle 
times of partitions #1 and #2 are the result of the relative partition sizes: partitions 
#1 and #2 contain 8 bits each, while the rest contain 12 bits. The decrease in the idle 
times from partition #3 to #6 follows the communication through the carry chain: the 
further down the chain, the longer it takes to settle. 

The speedup results reported earlier can now be explained. The expected speedup 


for N processors can be expressed as: 


Speedup = Nnenp 


where np is the processor utilization factor. For the six partition experiment, we obtain 
an expected speedup of 4.03. The non-linearity of the curve can be explained by np 


decreasing as the number of partitions is increased. 
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Table 4.6. Idle Time per Partition 


Table 4.7 shows a break down of where the active (non-idle) time was spent by 
each partition. The figures are percentages of the total active time of each partition. 
The data is divided into three categories of activity as follows: 

Simulation: the time spent in the RSIM simulation algorithm itself. This is subdi- 

vided as follows: 

Arithmetic: the time spent in the scaled fixed point arithmetic routines. 
Other: all other aspects of the RSIM algorithm. 

History: the time which may be attributed to the roll back synchronization scheme. 

This is subdivided as follows: 

Checkpoint: the time spent creating and maintaining the state checkpoints. 
Roll Back: the time spent restoring the state upon roll back. 

Communication: the time associated with interprocessor communication. This is 

subdivided as follows: 

System Level: the time spent polling and manipulating the interrupt level 
message queues. 

User Level: the time spent constructing and handling messages at the user 
level. 

Table 4.7 shows the amount of time spent in overhead is relatively small; nearly 
90% of the active time is spent in the simulation algorithm, most of that in the fixed 
point routines. The overhead time is dominated by the communication, and not by the 


history mechanism. Only in partition # 2, which had a relatively high incidence of roll 
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Partition Number 


Simulation 
Arithmetic 
Other 
Total 


History 
Checkpoint 
Roll Back 
Total 


Communication 
System Level 
User Level 
Total 


Table 4.7. Breakdown of Time Spent by Function 


back, is the checkpointing overhead non-negligable. 


4.3. Discussion 


There are two important conclusions that can be reached from the results reported 
in this chapter. First, the circuit partitioning has a significant impact on the scaling 
performance of the simulator. The dominant effect, at least in the small test case 
reported here, is not the overhead associated with communication or synchronization, 
but is the dynamic load balance. Even though the test circuit was statically well 
partitioned, the dynamic behavior resulted in only about 70% processor utilization 
with six partitions. Decreasing processor utilization resulted in “diminishing returns” 
in the speedup factor, as shown in Figure 4.5. 

The second conclusion is that the results reported are inconclusive. Because the 
active time was so completely dominated by the simulation load, it is difficult to build 
any detailed models of the overhead costs associated with the history and roll back 
mechanisms. The test circuit was too small and too regular to exhibit much interesting 
behavior. Somewhat better results could perhaps have been achieved by running much 
longer test vectors. Unfortunately, the current implementation of PRSIM is severely 


memory bound. If automatic resynchronization were employed to limit the storage 
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Chapter V 


Conclusion 


5.1. Summary 


Integrated circuit technology has been advancing at a phenomenal rate over the 
last several years, and promises to continue to do so for the foreseeable future. If 
circuit design is to keep pace with fabrication technology, radically new approaches 
to computer-aided design will be necessary. This thesis has explored the problems 
of capacity limitation in existing simulation tools, and has sought to develop a new 
approach to building fast, scalable circuit simulators. 

We began by examining the locality inherent in digital circuit operation. Digital 
circuit elements operate on local information, producing local results. It was observed 
that there exists a class of simulation algorithms which exhibit a similar locality prop- 
erty. Therefore, we set out to develop a framework for circuit simulation which could 
take advantage of this locality to achieve a high degree of parallelism. The scheme we 
developed involved mapping the circuit to be simulated onto the topology of the target 


multiprocessor to take advantage of the natural structure of the circuit. We explored 
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the problems associated with the precedence constraints imposed by the partitioning. 
We discovered that many of these problems could be avoided by inserting buffers be- 
tween the partitions, effectively decoupling them. This lead to the development of a 
synchronization mechanism based upon history maintenance and roll back. By peri- 
odically saving the state of the simulation, each partition can be allowed to simulate 
asynchronously with respect to the others, rolling back the simulation if necessary to 
correct for input changes. This solution was demonstrated to be sufficient to guar- 
antee convergence in the presence of feedback. We then discussed the importance of 
the strategy used to partition the circuit, and argued that static graph partitioning 
techniques may not be adequate. Finally, we quickly reviewed some related research, 


with emphasis on the relationship to the work report in this thesis. 


To determine the merit of the ideas presented in Chapter Two, a circuit simu- 
lator, PRSIM, was designed and built. Chapter Three discussed the details of that 
implementation. The chapter began with an overview of the RSIM simulator and the 
Concert multiprocessor on which PRSIM is based. RSIM was chosen as the vehicle for 
this implementation because it is an event driven simulator which exhibits the locality 
properties discussed in Chapter Two. PRSIM is organized into three components: the 
prepass phase, which is responsible for partitioning the circuit; the coordinator, which 
is responsible for attending to the administrative functions, such as file I/O and in- 
terfacing to the user; and the simulation slave, which performs the actual work of the 
RSIM algorithm. We discussed the organization of the simulation control loop, which 
is decomposed into two phases: an event processing phase, and an input processing 
phase. This two phase organization, together with the variable checkpointing strategy, 
is sufficient to guarantee convergence according to the argument presented in Chapter 
Two. All interprocessor communication is implemented by a simple, non-blocking mes- 
sage passing mechanism, built on top of the Concert Level 0 message passing protocol. 
Some optimizations were made in light of the fact that Concert is a tightly coupled 


multiprocessor system, but the essential mechanism does not rely upon shared memory. 
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Finally, the history maintenance and roll back algorithms were presented in detail. 

A preliminary set of experiments were run to determine the scaling behavior of 
PRSIM. The experiments were organized into two sets. The first set was designed 
to measure the overall performance of PRSIM, while the second set was designed to 
obtain detailed information about the internal behavior of PRSIM. From the first set, we 
learned that the performance increased by nearly a factor of 2 in going from one to two 
partitions, but that beyond two there was a “diminishing returns” phenomena. From 
the profiling experiments of the second set, we discovered the cause of this behavior 
was the processor utilization decreased as the circuit was partitioned into finer and 
finer pieces. The profiling experiments also revealed that less than 12% of the active 
processing time was spent in overhead associated with parallel execution. Although 
this result was somewhat encouraging, it made it nearly impossible to develop models 
of the overhead costs and scaling behavior. The conclusion derived from these results 


is that the partitioning strategy is very important, and requires further research. 


5.2. Directions for Future Research 


The results reported in this thesis suggest several avenues for future research. One 
of the most serious problems encountered was the bound on the length of the simulation 
which resulted from the memory requirements of checkpointing. This suggests the need 
for automatic resynchronization: reclaiming old state once it can be guaranteed that no 
partition can be forced to roll back beyond a certain point. This will require additional 
communication to coordinate the checkpointing, but is probably cost effective in the 
long run. 

The checkpointing strategy that was implemented was based upon empirical results 
with arbitrarily chosen parameters. One direction for future work is to develop formal 
statistical models for the communication behavior of digital circuits. These models 
could then be used to optimize the checkpointing strategy for a particular circuit, either 
statically at partition time, or dynamically based upon the communication patterns 


observed. 


—~ 65 — 


Perhaps the most important issue raised is the problem of effective network parti- 
tioning. It would be interesting to explore the limits of static partitioning algorithms. 
Ultimately, however, it will probably be necessary to turn to some form of dynamic 
partitioning. Two quantities determine the performance of a given partitioning: the 
amount of useful simulation work accomplished by each partition, and the amount of 
communication among the partitions. A dynamic partitioning strategy should try to 
balance the first quantity, while minimizing the second. We can view the level of activity 
in a partition as a “temperature”. As the activity (simulation work and communica- 
tion) increases, the temperature rises. The goal of dynamic partitioning is to achieve a 
low, uniform temperature across the multiprocessor. Periodically, the temperature of 
each partition should be sampled, and atomic units from hotter partitions moved into 
adjacent, cooler partitions, following the temperature gradient. If the fluctuations in 
temperature have a very short time constant (on the order of a single clock cycle), it 


may only be necessary to repartition once or twice near the beginning of a simulation. 


The framework that we have described does not rely upon the memory architecture 
of any particular multiprocessor. It is intriguing to consider the possibility of a simula- 
tion spread among a loosely coupled collection of machines. For example, it should be 
possible to build a simulator which locates idle machines on a local area network, and 
dispatches pieces of the simulation load to them. To determine the viability of this idea, 
we need a better understanding of the sensitivity of our approach to message latency. 
A series of experiments can be performed with the current PRSIM implementation in 


which the message delivery latency is varied by the sending processor. 


A great deal of the active run time of PRSIM was spent in the fixed point arithmetic 
package. Although not directly related to the field of parallel simulation, this problem 
suggested the construction of an assigned delay simulator. The prepass phase of such a 
simulator would construct a table of transition delays for each node in the circuit using 
the RSIM (or any other) model. Having thus precomputed the delays for every node in 


the circuit, at run time the simulator need only perform a table look up to schedule an 
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event. 


5.3. Conclusion 


We have presented an approach to parallel simulation which is based upon the 
inherent parallelism of circuit operation. The initial implementation of PRSIM demon- 
strates that history maintenance and roll back is a viable solution to interprocessor 
synchronization in this context. Much work remains to be done, however, to determine 


whether this approach can indeed be scaled to an arbitrary number of processors. 
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Appendix A 


PRSIM Messages 


The following is a list of the message types used by PRSIM. The entry for each mes- 
sage contains the name of the message, the purpose of the message, and the information 
contained in the body. The messages are divided into three groups: coordination of 


the simulation; file I/O with the host computer; and support for the user interface. 


The following group of messages support the coordination of the simulation activity. 


LOAD-NETWORK 
The LOAD-NETWORK message is sent from the coordinator to each simulation 
slave upon initialization. This message contains the number of partitions in 
the simulation, the partition ID for the receiving processor, the table to map 
partition numbers to processor numbers, and the name of the partition file 


on the host computer. 


LOAD-NETWORK REPLY 
The LOAD-NETWORK REPLY message is sent by each slave to the coordinator 


upon the completion of the network initialization. The body is empty. 
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SETNODE 
The SETNODE message is sent to a slave to inform it of external node changes. 
The body contains the simulated time the change took place, and a list of 


(node ID, new value) pairs. 


STEP 
The STEP message is sent from the coordinator to all slave processors to 
initiate a simulation step. The body contains the simulated time the step is 


to terminate. 


SETTLED 
The SETTLED message is sent from a slave to the coordinator to notify it that 
the slave has reached the specified termination time. The body contains the 


partition number of the sender. 


ROLLBACK 
The ROLLBACK message is sent from a slave to the coordinator and all depen- 
dent partitions to notify them that the slave has rolled back its simulation. 
The body of this message contains partition number of the sender, and the 


simulated time the partition rolled back to. 


RESYNC 
The RESYNC message is sent from the coordinator to all slave processors 
to inform them the simulation has settled. The slave processors use this 
information to reclaim the storage in the checkpoint and input lists. The 


body of this message is empty. 
The following group of messages implements remote file access. 


FOPEN 
The FOPEN message is a request from a slave to the coordinator to open the 


named file on the host computer. Only one open file is allowed at any one 
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time. The body contains the host file name and the access mode (e.g., read 


or write). 


FOPEN REPLY 
The FOPEN REPLY message informs the slave the requested file is open and 
ready for use. The body contains a single integer reflecting the result of the 


open operation: a 0 indicates a successful open, a —1 indicates an error. 


FREAD 
The FREAD message is a request from a slave to the coordinator to read a 
block of data from the open file. The body contains the number of items 
to be read, and the size of each item. A maximum of 1024 bytes may be 


requested. 


FREAD REPLY 
The FREAD REPLY message contains the data requested by a FREAD message. 


The body contains the number of items read and the data read. 


FWRITE 
The FWRITE message is a request from a slave to the coordinator to write a 
block of data to the open file. The body contains the number of items to be 


written, the size of each item, and the data to be written. 


FWRITE REPLY 
The FWRITE REPLY message reports the result of a FWRITE message. The 
body contains an integer error value which is 0 if the write was successful, 


—1 if the write failed. 


FCLOSE 
The FCLOSE message is a request from a slave to the coordinator to close the 


opened file. No reply is necessary. 


Se ates 


The following group of messages support the user interface. 


PRINTF 
The PRINTF message is sent by a slave processor to the coordinator to print an 
arbitrary string on the user’s console. The body contains the null terminated 
ASCII string to be printed. The coordinator prefixes the partition ID of the 


slave to the string before printing. 


GETNODE 
The GETNODE message is a request by the coordinator to obtain the current 
value of a given node from a slave. The body contains the global ID of the 


node. 


GETNODE REPLY 
The GETNODE REPLY message is the reply from a slave to the coordinator to 


a GETNODE request. The body contains the value of the requested node. 


NODE-INFO 
The NODE-INFO message is a request by the coordinator to obtain connectiv- 
ity information about a node within the network. This message is originally 
sent to the slave responsible for driving the node. This slave prints its rel- 
evant information for the user (via PRINTF messages), and then forwards 
the NODE-INFO message to any adjacent partitions. Each adjacent partition 
sends its information directly back to the coordinator in the form of PRINTF 
messages, and then replies to the forwarding slave. When all adjacent parti- 
tions have replied, the forwarding slave replies to the coordinator. The body 
of the NODE-INFO message contains the global ID of the requested node and 


the type of information requested. 


NODE-INFO REPLY 


The NODE-INFO REPLY message is sent by a slave partition to the processor 
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which requested NODE-INFO, after all of the information has been printed. 


The body contains the partition ID. 


TRACE-NODE 
The TRACE-NODE message is sent by the coordinator to enable activity trac- 
ing for a particular node. The body contains the global ID of the node to 
be traced. The receiving partition sets a flag in the specified node to enable 
tracing. Whenever a traced node changes value, a notice is printed on the 


user’s console. 


UNTRACE-NODE 
The UNTRACE-NODE message is sent by the coordinator to cancel activity 


tracing for a particular node. The body contains the global ID of the node. 


GETNAME 
The GETNAME message is sent by a slave processor to the coordinator to 
request the ASCII name of a given node. The body contains the global ID of 
the node. This message is used when printing node information on the user’s 


console. 


GETNAME REPLY 
The GETNAME REPLY message is the coordinator’s reply to the GETNAME 
message. The body contains a null terminated ASCII string representing the 


name of the requested node. 


DEBUG-LEVEL 


The DEBUG-LEVEL message is sent from the coordinator to all slave processors 
to set the debug level. The value in the body determines the type and quantity 
of debugging information to display. There is no reply. 


a) 


ENABLE-PROFILE 
The ENABLE-PROFILE message is sent from the coordinator to all slave pro- 


cessors to enable the performance monitoring software. The body is empty, 


and there is no reply. 


DISABLE-PROFILE 
The DISABLE-PROFILE message is sent from the coordinator to all slave pro- 


cessors to disable the performance monitoring software. The body is empty, 


and there is no reply. 
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Appendix B 


History Implementation 


/* This file contains the implementation of the history 
* maintenance and roll back mechanisms of PRSIM. 


* A few globally defined structures are reproduced below. 


*/ 
/* Useful data structures */ 


struct Event { 
evptr flink, blink; 
nptr enode; 
long ntime; 
char eval; 
char type; 

3; 


struct Checkpt 
{ ckptr flink, blink; 
long ctime; 
int ev_index; 
struct Event *event[TSIZE]; 
struct Event *overflow; 


/* 
/* 
/* 
/* 
/* 
/* 


/* 
/* 
/* 
/* 
/* 
/* 


the structure of an event */ 
doubly-linked event list */ 

node this event is all about */ 
time, in DELTAs, of this event */ 
new value */ 

type of event */ 


the structure of a checkpoint */ 
double linked list checkpoint list */ 
time checkpoint was taken */ 

index into event array */ 

copy of event array */ 

copy of overflow event list */ 


char *svect; /* pointer to node state table */ 
+; 
struct Input { /* linked list of inputs */ 
iptr next; /* next element of list */ 
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nptr inode; /* pointer to this input node */ 


}; 

/* For convenience, pointers are abbreviated as follows */ 
typedef struct Event *iptr; /* event pointer */ 
typedef struct Evckpt *ckptr; /* checkpoint pointer */ 
typedef struct Input *iptr; /* input pointer */ 


/* Routine to checkpoint the state of the simulation 
* Note that the checkpointed event array & overflow list are stored more 
* compactly than the originals. 


*/ 


checkpoint () 
{ register ckptr ctmp; 
register evptr etmp, ev, ev_base; 
register int i, j; 
char *ptr; 


/* get a ckpt structure from free list, allocating more if neccessary */ 
if ((ctmp = ck-free) == NULL) 
{ ctmp = (ckptr)al_bytes(10 * sizeof(struct Evckpt)); 
ptr = (char *)al_bytes(numsi0) ; 
for (i = 10; --i > 0; ctmp++) 
{ ctmp->flink = ck_free; 
ck.free = ctmp; 
ctmp->svect = ptr; 
ptr += nums; 
} 
ctmp->svect = ptr; 
} 
else ck.free = ctmp->flink; 


/* add new ckpt struct to list of checkpoints */ 
ctmp->flink = &ck_list; 
ctmp->blink = ck_list.blink; 
ck_list.blink->flink = ctmp; 
ck_list.blink = ctmp; 


/* copy event array into ckpt struct */ 
for (i = 0; i < TSIZE; i++) /* loop over lists in array */ 
{ ev_base = &ev_array [il]; 
ev = ev_base; 
ctmp->event[i] = NULL; 


if (ev->flink == ev) /* if it’s empty, do nothing */ 
continue; 
while ((ev = ev->flink) != ev_base) /* loop over each event in list */ 


/* allocate event struct */ 
{ if ((etmp = evfree) == NULL) 
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{ etmp = (evptr)al_bytes(10 * sizeof(struct Event)); 
for (j = 10; --j > 0; etmp++) 
{ etmp->flink = evfree; 
evfree = etmp; 
} 
} 


else evfree = etmp->flink; 


/* copy contents of old (ev) to new event */ 
etmp->enode = ev->enode; 
etmp->ntime = ev->ntime; 
etmp->eval = ev->eval; 
etmp->type = ev->type; 
/* add new event to checkpoint event array */ 
if (ctmp->event[i] == NULL) 
etmp->flink = etmp->blink = ctmp->event[i] = etmp; 
else 
{ etmp->flink = ctmp->event [i]; 
etmp->blink = ctmp->event [i]->blink; 
ctmp->event [i]->blink->flink = etmp; 
ctmp->event [i]->blink = etmp; 
} 


} 


/* copy overflow array into ckpt struct */ 

ev = &overflow; 

ctmp->overflow = NULL; 

if (ev->flink != ev) 

while ((ev = ev->flink) != soverflow) 
/* allocate event structure */ 
{ if ((etmp = evfree) == NULL) 
{ etmp = (evptr)al_bytes(10 * sizeof(struct Event)); 
for (j = 10; --j > 0; etmp++) 
{ etmp->flink = evfree; 
evfree = etmp; 
} 
} 
else evfree = etmp->flink; 

/* copy contents of old (ev) to new event */ 
etmp->enode = ev->enode; 
etmp->ntime = ev->ntime; 
etmp->eval ev->eval; 
etmp->type = ev->type; 

/* add new event to checkpoint event array */ 
if (ctmp->overflow == NULL) 

etmp->flink = etmp->blink = ctmp->overflow = etmp; 
else 
{ etmp->flink = ctmp->overflow; 
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etmp->blink = ctmp->overflow->blink; 
ctmp->overflow->blink->flink = etmp; 
ctmp->overflow->blink = etmp; 


} 
} 
/* £111 out rest of checkpoint struct */ 
ctmp->ctime = cur_delta; /* time stamp of checkpoint */ 
ctmp->ev_index = ev_index; /* place in event array */ 
checkpt_nodes (ctmp) ; /* go get node values, too */ 
last_ck = cur_delta; /* remember that we checkpointed */ 


} 


/* Routine to checkpoint the state of the nodes. 
* Walks the network, copying each node value & the state of the 
* INPUT flag into ctmp svect array, two nodes per byte. 
* Argument is a pointer to the checkpoint structure. 


*/ 


checkpt_nodes(ctmp) 
register ckptr ctmp; 
{ register nptr n; 
register int i, vindex = 0; 
register char nib = O, curbyte; 
for (i = 0, vindex = 0; i < HASHSIZE; i++) 
for (n = hash[i]; n; n = n->hnext) 
{ if (nib == 0) /* even nodes in low nibble */ 
{ nibt+; 
curbyte = n->npot; 
if (n->nflags & INPUT) curbyte |= 0x04; 
} 
else /* odd nodes in high nibble */ 
{ nib = 0; 
curbyte |= (n->npot << 4); 
if (n->nflags & INPUT) curbyte |= 0x40; 
ctmp->svect[vindex] = curbyte; 
vindex++; 
} 
} 


if (nib) ctmp->svect[vindex] = curbyte; 


/* Routine to restore the state of the nodes from a checkpoint. 
Walks the network, copying each node value & the state of the 
INPUT flag from ctmp svect array. 

* Argument is a pointer to the checkpoint structure. 


*/ 


restore_nodes(ctmp) 
register ckptr ctmp; 


* * 
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{ register nptr n; 
register int i, vindex = 0; 
register char nib = O, curbyte; 
for (i = 0, vindex = 0; i < HASHSIZE; i++) 
for (n = hash[i]; n; n = n->hnext) 
{ curbyte = ctmp->svect[vindex] ; 
if (nib) 
{ nib = 0; 
n->evi = n->ev2 = NULL; 
n->npot = ((curbyte >> 4) & 0x03); 
if (curbyte & 0x40) n->nflags |= INPUT; 
else n->nflags &= INPUT; 
vindex++; 
} 
else 
{ nibt+; 
n->evi = n->ev2 = NULL; 
n->npot = (curbyte & 0x03); 
if (curbyte & 0x04) n->nflags |= INPUT; 
else n->nflags &= INPUT; 


} 


/* Roll the simulation back to a time before t and restore the state 
* from event checkpoint and node history lists 


*/ 


roll_back(t) 

register long t; 

{ register ckptr ctmp; 
register int i, j; 
register evptr ev, etmp, ev_base; 
ckptr nctmp; 
int nevents = 0; 
int oevents = 0; 


/* find closest checkpoint to the roll-back time */ 
ctmp = ck_list.blink; 
while (ctmp->ctime > t) 
if (ctmp->blink == &ck_list) 
{ error("; roll_back: can’t go back to %d",t); 
return 0; 
} 
else ctmp = ctmp->blink; 


/* tell everyone who cares that we’re rollin’ back */ 
rollback_notify (ctmp->ctime) ; 

/* walk the network restoring node values */ 
restore_nodes (ctmp); 
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/* restore event array & overflow list, simulated time */ 
for (i = 0; i < TSIZE; i++) 
/* free up old current events */ 
{ ev_base = &ev_array [i]; 
if (ev_base->flink != ev_base) 
{ ev_-base->blink->flink = evfree; 
evfree = ev_base->flink; 
ev_base->flink = ev_base->blink = ev_base; 
} 
/* make a copy of this event list, if there is one */ 
if (ctmp->event[i] != NULL) 
{ ev = ctmp->event[i]; 
do 
/* allocate event struct */ 
{ if ((etmp = evfree) == NULL) 
{ etmp = (evptr)al_bytes(10 * sizeof(struct Event)); 
for (j = 0; --j > 0; etmp++) 
{ etmp->flink = evfree; 
evfree = etmp; 
} 
} 
else evfree = etmp->flink; 
/* Copy event data into new event */ 
etmp->enode = ev->enode; 
etmp->ntime = ev->ntime; 
etmp->eval = ev->eval; 
etmp->type = ev->type; 
etmp->flink = ev_base; 
etmp->blink = ev_base->blink; 
ev_base->blink->flink = etmp; 
ev_base->blink = etmp; 
/* link nodes to events */ 
if (ev->type == 0) 
etmp->enode->evi = etmp; 
else if (ev->type == 1) 
etmp->enode->ev2 = etmp; 
} 
while ((ev = ev->flink) != ctmp->event[il]); 
} 
} 


/* restore pointer into event array */ 
ev_index = ctmp->ev_index; 
/* free up current overflow events */ 
if (overflow.flink != soverflow) 
{ overflow. blink->flink = evfree; 
evfree = overflow. flink; 
overflow.flink = overflow.blink = &overflow; 
} 
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/* make a copy of this event list, if there is one */ 
if (ctmp->overflow != NULL) 
{ ev = ctmp->overflow; 
do 
/* allocate event struct */ 
{ if ((etmp = evfree) == NULL) 
{ etmp = (evptr)al_bytes(10 * sizeof(struct Event)); 
for (j = 0; --j > 0; etmp++) 
{ etmp->flink = evfree; 
evfree = etmp; 
} 
} 
else evfree = etmp->flink; 
/* Copy event data into new event */ 
etmp->enode = ev->enode; 
etmp->ntime = ev->ntime; 
etmp->eval = ev->eval; 
etmp->type = ev->type; 
etmp->flink = koverflow; 
etmp->blink = overflow.blink; 
overflow.blink->flink = etmp; 
overflow.blink = etmp; 
/* link nodes to events */ 
if (ev->type == 0) 
etmp->enode->evi = etmp; 
else if (ev->type == 
etmp->enode->ev2 = etmp; 
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} 
while ((ev = ev->flink) != ctmp->overflow); 
} 


/* restore current simulated time, and remember there's a 
* good checkpoint here 
*/ 
cur_delta = ctmp->ctime; 
last_ck = cur_delta; 
/* back up input list */ 
while ((cur_input->ntime >= cur_delta) && (cur_input != sinlist)) 
cur_input = cur_input->blink; 


/* garbage collect old checkpoints */ 
if (ctmp->flink == &ck_list) 


return; /* nothing to collect */ 
netmp = ctmp->flink; /* remember next struct in list */ 
ctmp->flink = &ck_list; /* make last struct point to end */ 
ck_list.blink->flink = ckfree; /* old end points to free list */ 
ck_list.blink = ctmp; /* ... and end point to it */ 
ctmp = netmp; 
while (ctmp != ck-free) /* now collect events inside */ 


—81— 


{ for (i = 0; i < TSIZE; i++) 

{ if ((ev = ctmp->event[i]) == NULL) continue; 
ev->blink->flink = evfree; 
evfree = ev; 
ctmp->event[i] = NULL; 

} 

if ((ev = ctmp->overflow) != NULL) 

{ ev->blink->flink = evfree; 
evfree = ev; 
ctmp->overflow = NULL; 

} 

ctmp = ctmp->flink; 
} 


ck free = nctmp; 


/* Clean up and dispose of ancient history properly 
* We walk the checkpoint list, reclaiming all events inside, and 
* then reclaim the checkpoint list itself. 
* We then move all input changes en masse to the free list. 
* Finally, we take a new checkpoint, just for fun. 


*/ 


cleanup-_hist () 
{ register ckptr ctmp, nctmp; 
register evptr etmp, ev; 
register int i; 


/* free up all checkpoint structures 
* for each checkpoint, we must first free up all event 
* structures 
*/ 
ctmp = ck_list.flink; 
while (ctmp != &ck_list) 
{ for (i = 0; i < TSIZE; i++) 
{ if ((etmp = ctmp->event[i]) == NULL) continue; 
ev = etmp; 
etmp->blink->flink = evfree; 
evfree = etmp; 
ctmp->event[i] = NULL; 
} 
if ((etmp = ctmp->overflow) != NULL) 
{ 
ev = etmp; 
etmp->blink->flink = evfree; 
evfree = etmp; 
ctmp->overflow = NULL; 
} 
netmp = ctmp->flink; 
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ctmup->flink = ck free; 
ck.free = chap; 
ctap = actup; 

} 


ck.list.fliak = ckiist.blink = &ck list; 
last.ck = 0; 


/* flash old input changes (inputs om in.list before current tine) */ 
ev « ginlist; 
cur_ingut->blink->flink © evfree; 
cur.input->blink © kinliet; 
eviree = inilist.fliak; 
inlist.2link © eur.tapet; 

/* oe’s we cam roll back to here if need be +/ 

; checkpoint (): 


Appendix C 


Raw Performance Data 


The following table contains the raw performance data from the experiments de- 
scribed in Section 4.2. The first column contains the name of the test vector, the 
first component of the name indicates the vector length. The second column contains 
the number of effective events generated for that vector. The remaining five columns 


contain the number of clock ticks (16.2mSec/tick) per vector for each experiment. 
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2A 1729 2242 
2B 1807 2323 
2C 1661 2105 
2D 2022 2662 
4A 3678 4979 
4B 4000 5395 
4C 3977 5233 
4D 4045 5541 
6A 6714 9425 
6B 4510 6345 
6C 7196 9761 
6D 6573 9347 
8A 8414 11714 
8B 8535 12339 
8C 8140 11706 
8D 7817 11652 
10A NA 16925 
10B NA 16182 
10C NA 12343 
10D NA 13287 
12A 13981 20453 
12B 14123 20285 
12C 10636 15928 
12D 11438 17328 
14A NA 21592 
14B NA 18348 
14C NA 22572 
14D NA 21195 
16A 16190 24972 
16B 17070 26500 
16C 20539 31872 
16D 14779 22609 
24A 25009 39719 
24B 21501 34636 
24C 29648 46341 
24D 24793 39983 


Simulation Time per Vector 
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Appendix D 


Profiling Data 


The following tables contain the raw profiling data for the six partition experiment. 


The first six tables contain the data for each separate partition, while the last table 


contains the aggregate sum. The data in each column are as follows: 


1. 


2. 


The name of the subroutine. 


The total time spent in each subroutine, measured in units of clock ticks 


(16.2mSec per tick). 
The total number of calls to each subroutine. 


The average time spent in each call. This is the quotient of the total 


time (expressed in mSec.) divided by the number of calls. 
The percentage of the total time that was spent in each subroutine. 


The percentage of the active simulation time spent in each subroutine. 
The active simulation time is the total time minus the idle time (time 


spent in step). 


Subroutines with a “O” number of calls are library routines which were not recom- 
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Subroutine mSec/Call 


step 172405 51 54763.941 50.04 0.00 
qlidiv 66391 6566045 0.164 38.56 
c_thev 19665 659956 0.483 11.42 
cvtcond 14475 4892172 0.048 8.41 
Handler 10073 5011287 0.033 5.85 
lqmul 9923 3144616 0.051 5.76 
muldiv 7601 2925075 0.042 4.42 
sim-step 6376 51 2025.318 3.70 
msg-poll 6080 0 0.000 3.53 
new-val 4918 199081 0.400 2.86 
main 4781 0 0.000 2.78 
checkpt_nodes 3991 3226 20.042 2.32 
Imul 3823 0 0.000 2.22 
enque 3143 335994 0.152 1.83 
make-clist 2674 199081 0.218 1.55 
setin 2225 16004 2.252 1.29 
check_inputs 1330 376813 0.057 0.77 
uldiv 1158 0 0.000 0.67 
checkpoint 3226 5.233 0.61 
cshare_make.clist 25976 0.393 0.37 
remul 219541 0.044 0.34 
charge-_share 25976 0.314 0.29 
cleanup-hist 51 140.400 0.26 
lrem 0 0.000 0.05 
check_overflow 30776 0.038 0.04 
find 16004 0.061 
msg-handler 16208 0.058 
node-_change 1599 0.091 
sbrk 0 0.000 
msg -free 399 0.325 
msg-cons 1650 0.059 
LOSend 0 0.000 
msg-send 1650 0.039 
malloc 0 0.000 
msg-alloc 1650 0.020 


Profiling Data for Partition # 1 
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Subroutine 


step 
qldiv 
c_thev 
cvtcond 
Handler 
lqmul 
checkpt _nodes 
muldiv 
sim-_step 
msg-_poll 
Imul 
new-_val 
main 


enque 
make-clist 
setin 
checkpoint 
check_inputs 


uldiv 
cleanup_hist 
cshare_make_clist 
remul 
charge_share 
restore_nodes 
roll_back 

lrem 
check_overflow 
find 
msg-handler 
sbrk 

msg-alloc 

msg free 
malloc 
node-change 
msg-_cons 
LOSend 
msg-send 
settled 


174648 
63293 
18164 
13401 
10086 

9543 
T7787 
6949 
6332 
6124 
5614 
4819 
4626 
3201 
2694 
2363 
2019 
1395 
1064 

877 

599 

531 

479 

392 


No. Calls 


51 
6276652 
632654 
4669912 
5135188 
3004391 
6346 
2785478 
316 

0 

0 
221392 
0 
351800 
221392 
17603 
6346 
398133 
0 

51 

2505 
218913 
25050 
265 

265 

0 

34899 
17603 
17794 

0 

902 

198 

0 

576 

902 

0 

902 

61 


c/a 


55476.424 
0.163 
0.465 
0.046 
0.032 
0.051 

19.879 
0.040 
324.615 
0.000 
0.000 
0.353 
0.000 
0.147 
0.197 
2.175 
5.154 
0.057 
0.000 
278.576 
3.874 
0.039 
0.310 
23.964 
6.113 
0.000 
0.035 
0.054 
0.045 
0.000 
0.108 
0.491 
0.000 
0.084 
0.036 
0.000 
0.018 
0.266 


Profiling Data for Partition # 2 
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Subroutine 


qidiv 
step 
c_thev 

cvtcond 

Iqmul 

muldiv 

Handler 
sim_step 
new-val 
msg-poll 

setin 

check pt_nodes 
enque 

main 

Imul 

make-clist 

uldiv 
check_inputs 
checkpoint 
cshare_make-clist 
remul 
charge-_share 
cleanup-_hist 
lrem 
check_overflow 
find 
msg-handler 
sbrk 
node_change 
msg_free 

malloc 

msg-_cons 
msg-alloc 
settled 
msg-send 


9587394 
51 
959099 
7148518 
4595117 
4273260 
4938794 
51 
296440 
0 

22980 
3406 
498914 
0 

0 
296440 
0 
426232 
3406 
39123 
321857 
39123 
51 

0 

35256 
22980 
23140 

0 
641 
159 
0 
692 
692 
51 
692 


Op 


mSec/Call 


% Total 


% Active 


0.162 
30326.718 
0.467 
0.047 
0.051 
0.041 
0.032 
2550.706 
0.369 
0.000 
3.358 
22.245 
0.148 
0.000 
0.000 
0.200 
0.000 
0.054 
5.232 
0.402 
0.041 
0.283 
149.929 
0.000 
0.046 
0.061 
0.043 
0.000 
0.152 
0.306 
0.000 
0.047 
0.047 
0.635 
0.000 


Profiling Data for Partition # 3 


29.18 
28.99 
8.39 
6.32 
4.42 
3.25 
2.95 
2.44 
2.05 
1.78 
1.45 
1.42 
1.39 
1.32 
1.27 
1.11 
0.49 
0.43 
0.33 
0.29 
0.25 
0.21 
0.14 
0.03 
0.03 
0.03 
0.02 
0.00 
0.00 
0.00 
0.00 
0.00 
0.00 
0.00 
0.00 


1.95 
1.86 
1.79 
1.57 
0.69 
0.61 
0.47 
0.41 
0.35 
0.29 
0.20 
0.04 
0.04 
0.04 
0.03 
0.01 
0.00 
0.00 
0.00 
0.00 


wec/Call | % Total | % Active 


qldiv 105704 10520809 0.163 32.59 41.61 
step 70274 51 22322.329 21.67 0.00 
c_thev 30744 1052866 0.473 9.48 12.10 
cvtcond 22494 7842552 0.046 6.94 8.86 
Iqmul 16116 5041376 0.052 4.97 6.34 
muldiv 11559 4693488 0.040 3.56 4.55 
Handler 9736 4948014 0.032 3.00 3.83 
sim_step 8473 103 1332.647 2.61 3.34 
new-_val 7236 310824 0.377 2.23 2.85 
msg-_poll 5804 0 0.000 1.79 2.28 
checkpt_nodes 5354 3887 22.314 1.65 2.11 
enque 5021 532620 0.153 1.55 1.98 
setin 4585 23045 3.223 1.41 1.80 
Imul 4560 0 0.000 1.41 1.80 
main 4493 0 0.000 1.39 1.77 
make-clist 3866 310824 0.201 1.19 1.52 
uldiv 1761 0 0.000 0.54 0.69 
check inputs 1501 444225 0.055 0.46 0.59 
checkpoint 1341 3887 5.589 0.41 0.53 
cshare_make-clist 1035 42618 0.393 0.32 0.41 
remul 850 347888 0.040 0.26 0.33 
charge.share 712 42618 0.271 0.22 0.28 
cleanup-hist 536 51 170.259 0.17 0.21 
lrem 106 0 0.000 0.03 0.04 
restore_nodes 93 52 28.973 0.03 0.04 
check_overflow 92 36748 0.041 0.03 0.04 
msg-handler 84 23210 0.059 0.03 0.03 
find 77 23045 0.054 0.02 0.03 
sbrk 25 0 0.000 0.01 0.01 
roll_back 19 52 5.919 0.01 0.01 
node_change 12 1341 0.145 0.00 0.00 
msg-alloc 8 1444 0.090 0.00 0.00 
msg-free 8 335 0.387 0.00 0.00 
msg-send 7 1444 0.079 0.00 0.00 
LOSend 6 0 0.000 0.00 0.00 
msg_cons 2 1444 0.022 0.00 0.00 
al_bytes 2 167 0.194 0.00 0.00 
rollback_notify 1 52 0.312 0.00 0.00 
malloc 1 51 0.318 0.00 0.00 
settled 0 51 0.000 0.00 0.00 


Profiling Data for Partition # 4 
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Subroutine No. Calls | mSec/Call 


qldiv 112574 11195769 0.163 
step 52083 51 16544.012 
c_thev 32214 1114184 0.468 
cvtcond 24011 8383728 0.046 
Iqmul 16919 5354395 0.051 
muldiv 12327 4989152 0.040 
Handler 9915 4954787 0.032 
sim-_step 8676 113 1243.816 
new-val 7603 323078 0.381 
msg-_poll 5914 0 0.000 
checkpt_nodes 5521 4029 22.199 
enque 5213 562673 0.150 
lmul 4811 0 0.000 
setin 4772 23745 3.256 
main 4361 0) 0.000 
make-clist 4195 323078 0.210 
uldiv 1820 0 0.000 
check_inputs 4557 455497 0.162 
checkpoint 1365 4029 5.488 
cshare_make-clist 1171 44897 0.423 
remul 967 365243 0.043 
charge_share 848 44897 0.306 
cleanup-hist 545 51 173.118 
find 23745 0.082 
restore_nodes 62 28.219 
msg-handler 23908 0.068 
lrem 0 0.000 
check -_overflow 38553 0.033 
roll_back 62 8.100 
sbrk 0 0.000 
node.change 1281 0.177 
msg-_send 1394 0.081 
LOSend 0 0.000 
msg-alloc 1394 0.070 
msg-free 341 0.238 
msg_cons 1394 0.035 
malloc 0 0.000 
rollback notify 62 0.261 
al_bytes 175 0.000 
settled 51 0.000 
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Profiling Data for Partition # 5 


263 


qldiv 125032 12343811 0.164 40.25 
c_thev 37045 1224108 0.490 11.92 
cvtcond 27094 9248812 0.047 8.72 
lqmul 19018 5918752 0.052 6.12 
muldiv 14468 5515133 0.042 4.66 
step 11799 51 3747.918 3.80 
Handler 10000 4811223 0.034 3.22 
sim_step 9626 151 1032.723 3.10 
new-val 8625 348289 0.401 2.78 
checkpt_nodes 6478 4680 22.424 2.09 
msg-poll 5816 0 0.000 1.87 
enque 5771 610069 0.153 1.86 
Imul 5560 0 0.000 1.79 
setin 4942 23685 3.380 1.59 
make-clist 4741 348289 0.221 1.53 
main 4628 0 0.000 1.49 
uldiv 2089 0 0.000 0.67 
check-_inputs 1721 476487 0.059 0.55 
checkpoint 1570 4680 5.435 0.51 
cshare_make-clist 1346 49596 0.440 0.43 
remul 1092 403619 0.044 0.35 
charge_share 893 49596 0.292 0.29 
cleanup-hist 662 51 210.282 0.21 
restore_nodes 188 100 30.456 0.06 
lrem 112 0 0.000 0.04 
msg-handler 103 23851 0.070 0.03 
find 92 23685 0.063 0.03 
check_overflow 75 40820 0.030 0.02 
roll_back 34 100 5.508 0.01 
sbrk 30 0 0.000 0.01 
malloc 2 0 0.000 0.00 
al_bytes 1 197 0.082 0.00 
msg-_cons 151 0.107 0.00 
msg-alloc 151 0.107 0.00 
msg-send 151 0.000 0.00 
rollback_notify 100 0.000 0.00 
settled 51 0.000 0.00 
msg-free 30 0.000 0.00 
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step 576682 30530.224 29.14 0.00 
qldiv 569089 56490480 0.163 
c_thev 165475 5642867 0.475 
cvtcond 122282 42185694 0.047 
lqmul 86084 27058647 0.052 
muldiv 63593 25181586 0.041 
Handler 59530 29799293 0.032 
sim_step 47513 785 980.523 
new-val 39949 1699104 0.381 
msg_-poll 35609 0 0.000 
checkpt_nodes 33808 25574 21.416 
lmul 28548 0 0.000 
main 27249 0 0.000 
enque 26913 2892070 0.151 
setin 23650 127062 3.015 
make-clist 21832 1699104 0.208 
check_inputs 11926 2577387 0.075 
uldiv 9505 0 0.000 
checkpoint 8437 25574 5.344 
cshare_make-clist 5751 204715 0.455 
remul 4844 1877061 0.042 
charge-share 4118 227260 0.294 
cleanup-hist 3534 306 187.094 
restore_nodes 781 479 26.414 
lrem 598 0 0.000 
check_overflow 494 217052 0.037 
find 494 127062 0.063 
msg-_handler 456 128111 0.058 
roll_back 184 479 6.223 
sbrk 141 0 0.000 
node-_change 44 5438 0.131 
msg-free 1462 0.332 
msg-alloc 6233 0.065 
LOSend ¢) 0.000 
msg-send 6233 0.049 
msg-_cons 6233 0.042 
malloc 51 5.082 
al_bytes 539 0.090 
rollback_notify 214 0.151 
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